[SPARK-18243][SQL] Port Hive writing to use FileFormat interface #16517

cloud-fan · 2017-01-09T16:18:13Z

What changes were proposed in this pull request?

Inserting data into Hive tables has its own implementation that is distinct from data sources: InsertIntoHiveTable, SparkHiveWriterContainer and SparkHiveDynamicPartitionWriterContainer.

Note that one other major difference is that data source tables write directly to the final destination without using some staging directory, and then Spark itself adds the partitions/tables to the catalog. Hive tables actually write to some staging directory, and then call Hive metastore's loadPartition/loadTable function to load those data in. So we still need to keep InsertIntoHiveTable to put this special logic. In the future, we should think of writing to the hive table location directly, so that we don't need to call loadTable/loadPartition at the end and remove InsertIntoHiveTable.

This PR removes SparkHiveWriterContainer and SparkHiveDynamicPartitionWriterContainer, and create a HiveFileFormat to implement the write logic. In the future, we should also implement the read logic in HiveFileFormat.

How was this patch tested?

existing tests

cloud-fan · 2017-01-09T16:20:50Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

@@ -99,7 +99,7 @@ class HadoopMapReduceCommitProtocol(jobId: String, path: String)
  }

  private def getFilename(taskContext: TaskAttemptContext, ext: String): String = {
-    // The file name looks like part-r-00000-2dd664f9-d2c4-4ffe-878f-c6c70c1fb0cb_00003.gz.parquet
+    // The file name looks like part-00000-2dd664f9-d2c4-4ffe-878f-c6c70c1fb0cb_00003.gz.parquet


@rxin @jiangxb1987 do you know why we have a random UUID in the middle of the file name? We pass it into HadoopMapReduceCommitProtocol as jobId. Are we trying to avoid conflicts when multiple jobs writing to a same path?

then we fixed a potential problem, the previous hive table writer doesn't handle this case: https://github.com/apache/spark/pull/16517/files#diff-92b05808926122b334c2fdd2fd1e4221L103

After more reading, the exist hive table writers do not have such an issue. It is based on a unique ID TaskAttemptID, which is generated by the function call of FileOutputFormat.getTaskOutputPath

cloud-fan · 2017-01-09T16:22:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

@@ -128,34 +128,32 @@ object FileFormatWriter extends Logging {
        .getOrElse(sparkSession.sessionState.conf.maxRecordsPerFile)
    )

-    SQLExecution.withNewExecutionId(sparkSession, queryExecution) {


I'll submit a different PR for this bug fix.

cloud-fan · 2017-01-09T16:24:15Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+  override def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+    case InsertIntoTable(table: MetastoreRelation, partSpec, query, overwrite, ifNotExists)
+        if hasBeenPreprocessed(table.output, table.partitionKeys.toStructType, partSpec, query) =>
+      InsertIntoHiveTable(table, partSpec, query, overwrite, ifNotExists)


to use FileFormatWriter, we need to make InsertIntoHiveTable a LogicalPlan like InsertIntoHadoopFsRelation, which means we need to convert InsertIntoTable to InsertIntoHiveTable during analysis.

cloud-fan · 2017-01-09T16:25:13Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableCommitProtocol.scala

+
+  override def commitJob(jobContext: JobContext, taskCommits: Seq[TaskCommitMessage]): Unit = {
+    import HiveTableCommitProtocol.SUCCESSFUL_JOB_OUTPUT_DIR_MARKER
+    // This is a hack to avoid writing _SUCCESS mark file. In lower versions of Hadoop (e.g. 1.0.4),


cc @liancheng @yhuai do you have more context about it? It will be greate if we can remove this hack and them we can remove this class.

cloud-fan · 2017-01-09T16:31:00Z

cc @rxin @yhuai @jiangxb1987 @gatorsmile

SparkQA · 2017-01-09T18:29:58Z

Test build #71083 has finished for PR 16517 at commit ca67a38.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-09T18:35:28Z

Test build #71085 has finished for PR 16517 at commit 0039f2f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2017-01-09T19:32:27Z

Maybe a better title is "Port Hive writing to use FileFormat interface"?

cloud-fan · 2017-01-10T04:30:22Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveWriterContainers.scala

-  }
-
-  override def commitJob(): Unit = {
-    // This is a hack to avoid writing _SUCCESS mark file. In lower versions of Hadoop (e.g. 1.0.4),


This hack is no longer needed, as we don't support hadoop 1.x now.

SparkQA · 2017-01-10T07:05:16Z

Test build #71111 has finished for PR 16517 at commit 36c9269.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-10T13:16:36Z

Test build #71122 has finished for PR 16517 at commit e208868.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-01-13T14:19:34Z

I manually tested hive table with storage handler(a jdbc storage handler: https://github.com/qubole/Hive-JDBC-Storage-Handler), and it still works after this PR.

SparkQA · 2017-01-13T17:02:00Z

Test build #71324 has finished for PR 16517 at commit 49af843.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-15T02:20:57Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala


-  @transient private val sessionState = sqlContext.sessionState.asInstanceOf[HiveSessionState]
-  @transient private val externalCatalog = sqlContext.sharedState.externalCatalog
+  override protected def innerChildren: Seq[LogicalPlan] = query :: Nil


+1

Let me see whether we can add such a test case to hit the bug without it.

We can't. We only replace InsertIntoTable with InsertIntoHiveTable at planner.

gatorsmile · 2017-01-15T04:46:25Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

@@ -99,7 +99,7 @@ class HadoopMapReduceCommitProtocol(jobId: String, path: String)
  }

  private def getFilename(taskContext: TaskAttemptContext, ext: String): String = {
-    // The file name looks like part-r-00000-2dd664f9-d2c4-4ffe-878f-c6c70c1fb0cb_00003.gz.parquet
+    // The file name looks like part-00000-2dd664f9-d2c4-4ffe-878f-c6c70c1fb0cb_00003.gz.parquet


The ext string is always starting from c. Below is the example I got from a test case.

part-00000-fd8f3fdd-653a-4ea0-ab6d-5c8ad610b184-c000.snappy.parquet

ok I should update this string, c000 is files-count, which is added recently.

gatorsmile · 2017-01-15T05:15:03Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+        mode == SaveMode.Ignore)
+  }
+
+  private def hasBeenPreprocessed(


Also add a code comment for this func?

/** * Returns true if the [[InsertIntoTable]] plan has already been preprocessed by analyzer rule * [[PreprocessTableInsertion]]. It is important that this rule([[HiveAnalysis]]) has to * be run after [[PreprocessTableInsertion]], to normalize the column names in partition spec and * fix the schema mismatch by adding Cast. */

gatorsmile · 2017-01-15T05:16:04Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+      if (mode == SaveMode.Append || mode == SaveMode.Overwrite) {
+        throw new AnalysisException(
+          "CTAS for hive serde tables does not support append or overwrite semantics.")
+      }


The above codes need to merge from the latest master build.

gatorsmile · 2017-01-15T07:34:27Z

Left a few comments. I am not 100% sure whether HiveFileFormat can completely replace the existing writer containers, but the other changes look good to me.

…riter

SparkQA · 2017-01-16T17:20:33Z

Test build #71452 has finished for PR 16517 at commit e130e1c.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-01-16T17:32:43Z

retest this please

SparkQA · 2017-01-16T20:24:58Z

Test build #71461 has finished for PR 16517 at commit e130e1c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2017-01-16T22:43:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala

   */
  def hiveResultString(): Seq[String] = executedPlan match {
    case ExecutedCommandExec(desc: DescribeTableCommand) =>
-      SQLExecution.withNewExecutionId(sparkSession, this) {


Explain the reason that SQLExecution.withNewExecutionId(sparkSession, this) is not needed?

yhuai · 2017-01-16T22:47:26Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+  /**
+   * Returns true if the [[InsertIntoTable]] plan has already been preprocessed by analyzer rule
+   * [[PreprocessTableInsertion]]. It is important that this rule([[HiveAnalysis]]) has to
+   * be run after [[PreprocessTableInsertion]], to normalize the column names in partition spec and


This rule is in the same batch with PreprocessTableInsertion, right? If so, we cannot guarantee that PreprocessTableInsertion will always fire first for a command before InsertIntoTable.

Or, you mean that we use this function to determine if PreprocessTableInsertion has fired?

Should this function actually be part of the resolved method of InsertIntoTable?

I think this version is good for now.

yhuai · 2017-01-16T22:52:17Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala

+  override def inferSchema(
+      sparkSession: SparkSession,
+      options: Map[String, String],
+      files: Seq[FileStatus]): Option[StructType] = None


Is it safe to return None?

yea, because we are not going to use it in read path.

ok. Let's throw an exception at here.

yhuai · 2017-01-16T22:52:49Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala

+      sparkSession: SparkSession,
+      job: Job,
+      options: Map[String, String],
+      dataSchema: StructType): OutputWriterFactory = {


Want to comment the original source of code in this function?

The preparation logic was dispersive before, I collected all of them and put them here.

yhuai · 2017-01-16T22:53:29Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala

+          context: TaskAttemptContext): OutputWriter = {
+        new HiveOutputWriter(path, fileSinkConfSer, jobConf.value, dataSchema)
+      }
+    }


Should we just create a class instead of using an anonymous class?

I followed other FileFormat implementations here.

cloud-fan · 2017-01-17T01:41:50Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala

+      InsertIntoHiveTable(table, partSpec, query, overwrite, ifNotExists)
+
+    case CreateTable(tableDesc, mode, Some(query)) if DDLUtils.isHiveTable(tableDesc) =>
+      // Currently `DataFrameWriter.saveAsTable` doesn't support the Append mode of hive serde


the code block below is mostly moved from https://github.com/apache/spark/pull/16517/files#diff-c4ed9859978dd6ac271b6a40ee945e4bL112

cloud-fan · 2017-01-17T01:44:10Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala

+      dataSchema: StructType): OutputWriterFactory = {
+    val conf = job.getConfiguration
+    val tableDesc = fileSinkConf.getTableInfo
+    conf.set("mapred.output.format.class", tableDesc.getOutputFileFormatClassName)


this is moved from https://github.com/apache/spark/pull/16517/files#diff-d579db9a8f27e0bbef37720ab14ec3f6L203

cloud-fan · 2017-01-17T01:45:55Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala

+    val tableDesc = fileSinkConf.getTableInfo
+    conf.set("mapred.output.format.class", tableDesc.getOutputFileFormatClassName)
+
+    // Add table properties from storage handler to hadoopConf, so any custom storage


this is moved from https://github.com/apache/spark/pull/16517/files#diff-92b05808926122b334c2fdd2fd1e4221L64

cloud-fan · 2017-01-17T01:48:10Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala

+        jobConf.value.getOutputFormat.asInstanceOf[HiveOutputFormat[AnyRef, Writable]]
+
+      override def getFileExtension(context: TaskAttemptContext): String = {
+        Utilities.getFileExtension(jobConf.value, fileSinkConfSer.getCompressed, outputFormat)


this is moved from https://github.com/apache/spark/pull/16517/files#diff-92b05808926122b334c2fdd2fd1e4221L102

cloud-fan · 2017-01-17T01:49:02Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala

+
+  private def tableDesc = fileSinkConf.getTableInfo
+
+  private val serializer = {


this is moved from https://github.com/apache/spark/pull/16517/files#diff-92b05808926122b334c2fdd2fd1e4221L160

cloud-fan · 2017-01-17T01:49:39Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala

+    serializer
+  }
+
+  private val hiveWriter = HiveFileFormatUtils.getHiveRecordWriter(


this is moved from https://github.com/apache/spark/pull/16517/files#diff-92b05808926122b334c2fdd2fd1e4221L121

cloud-fan · 2017-01-17T01:50:21Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala

+    new Path(path),
+    Reporter.NULL)
+
+  private val standardOI = ObjectInspectorUtils


code block below(until def write) is moved from https://github.com/apache/spark/pull/16517/files#diff-92b05808926122b334c2fdd2fd1e4221L167

SparkQA · 2017-01-18T04:03:13Z

Test build #71548 has finished for PR 16517 at commit 2f24c10.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2017-01-18T04:33:15Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveFileFormat.scala

+    // Add table properties from storage handler to hadoopConf, so any custom storage
+    // handler settings can be set to hadoopConf
+    HiveTableUtil.configureJobPropertiesForStorageHandler(tableDesc, conf, false)
+    Utilities.copyTableJobPropertiesToConf(tableDesc, conf)


Will tableDesc be null?

the tableDesc is created at https://github.com/apache/spark/pull/16517/files#diff-d579db9a8f27e0bbef37720ab14ec3f6R223 . So it will never be null, and the previous null check is unnecessary.

yhuai · 2017-01-18T04:37:21Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

-    // users that they may loss data if they are using a direct output committer.
-    val speculationEnabled = sqlContext.sparkContext.conf.getBoolean("spark.speculation", false)
-    val outputCommitterClass = jobConf.get("mapred.output.committer.class", "")
-    if (speculationEnabled && outputCommitterClass.contains("Direct")) {


Do we still need this?

the direct committer has been removed already.

seems this change is unnecessary and users may still use direct output committer (they can still find the code on Internet). Let's keep the warning.

yhuai · 2017-01-18T04:47:43Z

Looks good to me. @gatorsmile can you explain your concerns? I am wondering what kind of cases that you think HiveFileFormat may not be able to handle.

gatorsmile · 2017-01-18T06:56:43Z

No more questions after the latest changes. LGTM pending Jenkins

SparkQA · 2017-01-18T07:23:51Z

Test build #71569 has finished for PR 16517 at commit 150efa2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HiveFileFormat(fileSinkConf: FileSinkDesc) extends FileFormat with Logging

gatorsmile · 2017-01-18T07:38:54Z

Thanks! Merged to master.

## What changes were proposed in this pull request? Inserting data into Hive tables has its own implementation that is distinct from data sources: `InsertIntoHiveTable`, `SparkHiveWriterContainer` and `SparkHiveDynamicPartitionWriterContainer`. Note that one other major difference is that data source tables write directly to the final destination without using some staging directory, and then Spark itself adds the partitions/tables to the catalog. Hive tables actually write to some staging directory, and then call Hive metastore's loadPartition/loadTable function to load those data in. So we still need to keep `InsertIntoHiveTable` to put this special logic. In the future, we should think of writing to the hive table location directly, so that we don't need to call `loadTable`/`loadPartition` at the end and remove `InsertIntoHiveTable`. This PR removes `SparkHiveWriterContainer` and `SparkHiveDynamicPartitionWriterContainer`, and create a `HiveFileFormat` to implement the write logic. In the future, we should also implement the read logic in `HiveFileFormat`. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes apache#16517 from cloud-fan/insert-hive.

cloud-fan commented Jan 9, 2017

View reviewed changes

cloud-fan force-pushed the insert-hive branch from ca67a38 to 0039f2f Compare January 9, 2017 16:28

cloud-fan changed the title ~~[SPARK-18243][SQL] Implement InsertIntoHiveTable with FileCommitProtocol and FileFormatWriter~~ [SPARK-18243][SQL] Port Hive writing to use FileFormat interface Jan 10, 2017

cloud-fan force-pushed the insert-hive branch from 0039f2f to 36c9269 Compare January 10, 2017 04:28

cloud-fan commented Jan 10, 2017

View reviewed changes

cloud-fan force-pushed the insert-hive branch from 36c9269 to e208868 Compare January 10, 2017 10:28

cloud-fan force-pushed the insert-hive branch from e208868 to 49af843 Compare January 13, 2017 14:26

gatorsmile reviewed Jan 15, 2017

View reviewed changes

Implement InsertIntoHiveTable with FileCommitProtocol and FileFormatW…

e130e1c

…riter

cloud-fan force-pushed the insert-hive branch from 49af843 to e130e1c Compare January 16, 2017 15:03

yhuai reviewed Jan 16, 2017

View reviewed changes

cloud-fan commented Jan 17, 2017

View reviewed changes

address comments

2f24c10

yhuai reviewed Jan 18, 2017

View reviewed changes

address comments

150efa2

asfgit closed this in 4494cd9 Jan 18, 2017

gatorsmile mentioned this pull request Jan 18, 2017

[SPARK-19265][SQL] make table relation cache general and does not depend on hive #16621

Closed

wzhfy mentioned this pull request May 8, 2017

[SPARK-20635][SQL] No SQL tab in Spark UI #17897

Closed


		private def tableDesc = fileSinkConf.getTableInfo

		private val serializer = {

[SPARK-18243][SQL] Port Hive writing to use FileFormat interface #16517

[SPARK-18243][SQL] Port Hive writing to use FileFormat interface #16517

Conversation

cloud-fan commented Jan 9, 2017

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Jan 10, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Jan 9, 2017

SparkQA commented Jan 9, 2017

SparkQA commented Jan 9, 2017

rxin commented Jan 9, 2017

Choose a reason for hiding this comment

SparkQA commented Jan 10, 2017

SparkQA commented Jan 10, 2017

cloud-fan commented Jan 13, 2017

SparkQA commented Jan 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Jan 15, 2017

SparkQA commented Jan 16, 2017

gatorsmile commented Jan 16, 2017

SparkQA commented Jan 16, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Jan 17, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Jan 18, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhuai commented Jan 18, 2017

gatorsmile commented Jan 18, 2017 • edited

SparkQA commented Jan 18, 2017

gatorsmile commented Jan 18, 2017

cloud-fan Jan 10, 2017 •

edited

cloud-fan Jan 17, 2017 •

edited

cloud-fan Jan 18, 2017 •

edited

gatorsmile commented Jan 18, 2017 •

edited